By the end of this section, you will be able to:
Visualization is a critical component of data analysis, especially in the context of big data. While visualization helps uncover patterns and insights, working with large datasets presents unique challenges such as overplotting, performance issues, and information overload. In this section, we’ll explore strategies and tools to effectively visualize large datasets using R.
Visualizing big data comes with several challenges that require special consideration:
When too many points overlap in a scatter plot, patterns become obscured. This is particularly problematic with datasets containing millions of observations.
Rendering millions of points can be computationally intensive and slow, both in terms of generation and interactivity.
Large datasets may exceed available memory when creating complex visualizations.
Too much information in a single visualization can make it difficult to extract meaningful insights.
Visualizations need to remain effective and readable at different data scales.
To address these challenges, we employ several strategies:
For this training, we’ll use the Diamonds Dataset which contains 53,940 observations of diamond characteristics. While not “big data” by modern standards, it’s large enough to demonstrate visualization challenges and techniques.
# Load the diamonds dataset (comes with ggplot2)
data(diamonds)
# Convert to data.table for efficient manipulation
diamonds_dt <- as.data.table(diamonds)
# Create a larger version by sampling with replacement (for demonstration)
set.seed(123)
big_diamonds <- diamonds_dt[sample(.N, 100000, replace = TRUE)]
# Add some derived columns
big_diamonds[, price_per_carat := price / carat]
big_diamonds[, log_price := log10(price)]
big_diamonds[, size_category := cut(carat,
breaks = c(0, 0.5, 1, 1.5, 2, 5),
labels = c("Tiny", "Small", "Medium", "Large", "Very Large"))]
# Save to file for consistency
fwrite(big_diamonds, "data/big_diamonds.csv")
# Display dataset information
cat("Dataset Information:\n")## Dataset Information:
## Number of rows: 100,000
## Number of columns: 13
## First few rows:
## Dataset Structure:
## Classes 'data.table' and 'data.frame': 100000 obs. of 8 variables:
## $ carat : num 0.73 0.7 0.31 0.31 0.31 0.83 0.51 0.7 0.4 1.1 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 5 5 5 5 2 3 2 5 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 6 4 1 5 2 2 1 5 2 6 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 5 5 5 7 8 3 4 3 5 3 ...
## $ depth : num 60.7 60.8 61.6 62.2 60.9 63.7 62.5 64.2 61.6 61.2 ...
## $ table : num 56 56 55 56 55 59 58 58 56 61 ...
## $ price : int 2397 3300 713 707 987 3250 1668 1771 1053 4640 ...
## $ x : num 5.85 5.73 4.3 4.34 4.39 5.95 5.12 5.59 4.73 6.61 ...
## - attr(*, ".internal.selfref")=<externalptr>
##
##
## Summary Statistics:
summary_stats <- big_diamonds[, .(
Observations = .N,
Unique_Cuts = length(unique(cut)),
Unique_Colors = length(unique(color)),
Unique_Clarity = length(unique(clarity)),
Avg_Price = mean(price),
Avg_Carat = mean(carat),
Max_Price = max(price),
Min_Price = min(price)
)]
print(summary_stats)## Observations Unique_Cuts Unique_Colors Unique_Clarity Avg_Price Avg_Carat
## <int> <int> <int> <int> <num> <num>
## 1: 100000 5 7 8 3931.789 0.7974344
## Max_Price Min_Price
## <int> <int>
## 1: 18823 326
Before diving into ggplot2, let’s review base R graphics which can be useful for quick exploratory analysis.
# Set up multi-panel plot
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
# 1. Histogram of price
hist(big_diamonds$price, breaks = 50,
main = "Distribution of Diamond Prices",
xlab = "Price (USD)",
ylab = "Frequency",
col = "steelblue",
border = "white")
# 2. Boxplot of price by cut
boxplot(price ~ cut, data = big_diamonds,
main = "Price by Diamond Cut",
xlab = "Cut Quality",
ylab = "Price (USD)",
col = brewer.pal(5, "Set2"),
outline = FALSE)
# 3. Scatter plot (sampled for clarity)
sample_indices <- sample(nrow(big_diamonds), 1000)
plot(big_diamonds$carat[sample_indices],
big_diamonds$price[sample_indices],
main = "Carat vs Price (1,000 points)",
xlab = "Carat",
ylab = "Price (USD)",
pch = 16,
col = alpha("darkred", 0.3),
cex = 0.8)
# 4. Bar plot of cuts
cut_counts <- table(big_diamonds$cut)
barplot(cut_counts,
main = "Distribution of Diamond Cuts",
xlab = "Cut",
ylab = "Count",
col = brewer.pal(5, "Set3"),
border = NA)ggplot2 is a powerful visualization package based on the Grammar of Graphics. It provides a consistent and flexible framework for creating complex visualizations.
Every ggplot2 visualization consists of:
# Basic scatter plot
p_scatter <- ggplot(big_diamonds[sample(.N, 5000)], # Sample for clarity
aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = "Diamond Price vs Carat by Cut",
subtitle = "Sample of 5,000 diamonds",
x = "Carat",
y = "Price (USD)",
color = "Cut Quality") +
scale_color_brewer(palette = "Set2") +
theme_minimal()
print(p_scatter)# Histogram with density overlay
p_hist <- ggplot(big_diamonds, aes(x = price)) +
geom_histogram(aes(y = ..density..),
bins = 50,
fill = "steelblue",
alpha = 0.7) +
geom_density(color = "darkred", size = 1) +
labs(title = "Distribution of Diamond Prices",
subtitle = "With density curve overlay",
x = "Price (USD)",
y = "Density") +
scale_x_continuous(labels = dollar) +
theme_minimal()
print(p_hist)# Box plot with jitter
p_box <- ggplot(big_diamonds, aes(x = cut, y = price, fill = cut)) +
geom_boxplot(outlier.alpha = 0.3, outlier.size = 1) +
geom_jitter(alpha = 0.05, width = 0.2, size = 0.5) +
labs(title = "Price Distribution by Diamond Cut",
x = "Cut Quality",
y = "Price (USD)") +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = dollar) +
theme(legend.position = "none")
print(p_box)# Compare different sampling approaches
set.seed(123)
samples <- list(
random_1k = big_diamonds[sample(.N, 1000)],
random_5k = big_diamonds[sample(.N, 5000)],
stratified = big_diamonds[, .SD[sample(.N, min(200, .N))], by = cut]
)
# Create comparison plots
p_sampling_comparison <- ggplot(samples$random_5k,
aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE, size = 0.5) +
labs(title = "Sampling Strategy: 5,000 Random Points",
subtitle = "With linear trend lines by cut",
x = "Carat",
y = "Price (USD)") +
scale_color_brewer(palette = "Set2") +
facet_wrap(~cut, nrow = 1) +
theme(legend.position = "none")
print(p_sampling_comparison)# 2D histogram for dense data
p_bin2d <- ggplot(big_diamonds, aes(x = carat, y = price)) +
geom_bin2d(bins = 50) +
scale_fill_viridis(option = "plasma", trans = "log10") +
labs(title = "2D Histogram: Carat vs Price",
subtitle = "Using binning to handle dense data",
x = "Carat",
y = "Price (USD)",
fill = "Count\n(log10)") +
theme_minimal()
print(p_bin2d)# Hexagonal binning
p_hex <- ggplot(big_diamonds, aes(x = carat, y = price)) +
geom_hex(bins = 40) +
scale_fill_viridis(option = "magma", trans = "log10") +
labs(title = "Hexagonal Binning: Carat vs Price",
subtitle = "Alternative to 2D histogram",
x = "Carat",
y = "Price (USD)",
fill = "Count\n(log10)") +
theme_minimal()
print(p_hex)# Density plots
p_density <- ggplot(big_diamonds, aes(x = price, fill = cut)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Prices by Cut",
subtitle = "Shows distribution without individual points",
x = "Price (USD)",
y = "Density",
fill = "Cut") +
scale_fill_brewer(palette = "Set2") +
scale_x_continuous(labels = dollar, limits = c(0, 15000)) +
theme_minimal()
print(p_density)# Faceted plots
p_facet <- ggplot(big_diamonds[sample(.N, 5000)],
aes(x = carat, y = price, color = color)) +
geom_point(alpha = 0.4, size = 1) +
geom_smooth(method = "lm", se = FALSE, size = 0.5) +
labs(title = "Carat vs Price by Color and Clarity",
subtitle = "Faceted visualization",
x = "Carat",
y = "Price (USD)") +
scale_color_brewer(palette = "Spectral") +
facet_grid(color ~ clarity) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
print(p_facet)# Using stat_summary for aggregation
p_stat <- ggplot(big_diamonds, aes(x = cut, y = price)) +
stat_summary(fun = mean, geom = "point", size = 3, color = "red") +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar",
width = 0.2,
color = "red") +
stat_summary(fun = median, geom = "point", size = 3, color = "blue") +
labs(title = "Mean and Median Prices by Cut",
subtitle = "With 95% confidence intervals for means",
x = "Cut Quality",
y = "Price (USD)") +
scale_y_continuous(labels = dollar) +
theme_minimal()
print(p_stat)library(patchwork)
# Create individual plots
p1 <- ggplot(big_diamonds, aes(x = cut, y = price)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Box Plot", x = NULL, y = "Price")
p2 <- ggplot(big_diamonds, aes(x = price, fill = cut)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot", x = "Price", y = "Density") +
theme(legend.position = "none")
p3 <- ggplot(big_diamonds, aes(x = cut, fill = cut)) +
geom_bar() +
labs(title = "Bar Plot", x = "Cut", y = "Count") +
theme(legend.position = "none")
p4 <- ggplot(big_diamonds[sample(.N, 1000)],
aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.5) +
labs(title = "Scatter Plot", x = "Carat", y = "Price") +
theme(legend.position = "none")
# Combine plots
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Multi-panel Visualization of Diamond Data",
theme = theme(plot.title = element_text(hjust = 0.5, size = 14)))plotly is an interactive visualization library that works seamlessly with ggplot2 and can handle large datasets efficiently.
# Create a ggplot
p_gg <- ggplot(big_diamonds[sample(.N, 2000)],
aes(x = carat, y = price,
color = cut, size = depth,
text = paste("Cut:", cut, "<br>",
"Color:", color, "<br>",
"Clarity:", clarity, "<br>",
"Price: $", price))) +
geom_point(alpha = 0.6) +
scale_color_brewer(palette = "Set2") +
labs(title = "Interactive Diamond Explorer",
x = "Carat",
y = "Price (USD)") +
theme_minimal()
# Convert to plotly
p_plotly <- ggplotly(p_gg, tooltip = "text") %>%
layout(hoverlabel = list(bgcolor = "white"),
title = list(text = "Interactive Diamond Explorer<br><sub>Hover for details</sub>"))
# Display
p_plotly# Create interactive scatter plot directly with plotly
direct_plot <- plot_ly(
data = big_diamonds[sample(.N, 3000)],
x = ~carat,
y = ~price,
color = ~cut,
colors = "Set2",
type = "scatter",
mode = "markers",
marker = list(size = 8, opacity = 0.6),
text = ~paste("Cut:", cut, "<br>",
"Color:", color, "<br>",
"Clarity:", clarity, "<br>",
"Carat:", round(carat, 2), "<br>",
"Price: $", format(price, big.mark = ",")),
hoverinfo = "text"
) %>%
layout(
title = "Interactive Diamond Price Explorer",
xaxis = list(title = "Carat"),
yaxis = list(title = "Price (USD)"),
hoverlabel = list(bgcolor = "white", font = list(size = 12))
)
direct_plot# 3D scatter plot
plot_3d <- plot_ly(
data = big_diamonds[sample(.N, 2000)],
x = ~carat,
y = ~depth,
z = ~price,
color = ~cut,
colors = "Set2",
type = "scatter3d",
mode = "markers",
marker = list(size = 4, opacity = 0.7),
text = ~paste("Cut:", cut, "<br>Price: $", price),
hoverinfo = "text"
) %>%
layout(
title = "3D Diamond Analysis",
scene = list(
xaxis = list(title = "Carat"),
yaxis = list(title = "Depth (%)"),
zaxis = list(title = "Price (USD)")
)
)
plot_3d# Interactive histogram with multiple traces
plotly_hist <- plot_ly(alpha = 0.6)
# Add traces for each cut
cuts <- unique(big_diamonds$cut)
colors <- brewer.pal(length(cuts), "Set2")
for (i in seq_along(cuts)) {
plotly_hist <- plotly_hist %>%
add_histogram(
x = big_diamonds[cut == cuts[i]]$price,
name = cuts[i],
marker = list(color = colors[i]),
opacity = 0.6
)
}
plotly_hist <- plotly_hist %>%
layout(
title = "Interactive Price Distribution by Cut",
xaxis = list(title = "Price (USD)"),
yaxis = list(title = "Count"),
barmode = "overlay",
hovermode = "x unified"
)
plotly_hist# Create subplots with linked brushing
fig1 <- plot_ly(
data = big_diamonds[sample(.N, 2000)],
x = ~carat,
y = ~price,
color = ~cut,
type = "scatter",
mode = "markers",
source = "A"
)
fig2 <- plot_ly(
data = big_diamonds[sample(.N, 2000)],
x = ~depth,
y = ~table,
color = ~cut,
type = "scatter",
mode = "markers",
source = "A"
)
linked_plot <- subplot(fig1, fig2, titleX = TRUE, titleY = TRUE) %>%
layout(
title = "Linked Brushing: Select points in one plot to highlight in the other",
showlegend = FALSE
) %>%
highlight(
on = "plotly_selected",
off = "plotly_deselect",
persistent = FALSE
)
linked_plot# Create a dashboard with multiple plots
library(plotly)
# Plot 1: Scatter
p1 <- plot_ly(big_diamonds[sample(.N, 1000)],
x = ~carat, y = ~price,
color = ~cut, type = "scatter", mode = "markers",
marker = list(size = 6, opacity = 0.7)) %>%
layout(xaxis = list(title = "Carat"),
yaxis = list(title = "Price"))
# Plot 2: Box plot
p2 <- plot_ly(big_diamonds,
x = ~cut, y = ~price,
color = ~cut, type = "box") %>%
layout(xaxis = list(title = "Cut"),
yaxis = list(title = "Price"),
showlegend = FALSE)
# Plot 3: Histogram
p3 <- plot_ly(x = big_diamonds$price,
type = "histogram",
marker = list(color = "steelblue")) %>%
layout(xaxis = list(title = "Price"),
yaxis = list(title = "Count"))
# Plot 4: Bar chart
cut_counts <- big_diamonds[, .N, by = cut]
p4 <- plot_ly(cut_counts,
x = ~cut, y = ~N,
type = "bar",
marker = list(color = brewer.pal(5, "Set3"))) %>%
layout(xaxis = list(title = "Cut"),
yaxis = list(title = "Count"))
# Combine into dashboard
dashboard <- subplot(p1, p2, p3, p4,
nrows = 2,
shareX = FALSE,
shareY = FALSE,
titleX = TRUE,
titleY = TRUE) %>%
layout(title = "Diamond Data Dashboard",
showlegend = TRUE)
dashboardLet’s compare the performance and features of different visualization approaches.
# Function to time plot creation
time_plot_creation <- function(plot_func, iterations = 10) {
times <- numeric(iterations)
for (i in 1:iterations) {
start_time <- Sys.time()
plot_func()
times[i] <- as.numeric(Sys.time() - start_time)
}
return(mean(times))
}
# Define plot functions
base_r_plot <- function() {
par(mfrow = c(1, 1))
plot(big_diamonds[sample(.N, 5000)]$carat,
big_diamonds[sample(.N, 5000)]$price,
main = "Base R Plot",
xlab = "Carat", ylab = "Price",
pch = 16, col = alpha("blue", 0.3))
}
ggplot_plot <- function() {
p <- ggplot(big_diamonds[sample(.N, 5000)],
aes(x = carat, y = price)) +
geom_point(alpha = 0.3, color = "blue") +
labs(title = "ggplot2 Plot",
x = "Carat", y = "Price")
print(p)
}
# Time the plots
set.seed(123)
base_time <- time_plot_creation(base_r_plot, 5)# Create comparison data
comparison_data <- data.frame(
Method = c("Base R", "ggplot2", "plotly (static)", "plotly (interactive)"),
Speed = c(base_time, ggplot_time, ggplot_time * 1.5, ggplot_time * 2),
Interactivity = c("None", "None", "High", "High"),
Aesthetics = c("Basic", "Excellent", "Excellent", "Excellent"),
Learning_Curve = c("Low", "Medium", "Medium", "Medium"),
Best_For = c("Quick exploration", "Publication plots",
"Web dashboards", "Interactive reports")
)
# Display comparison table
DT::datatable(comparison_data,
options = list(pageLength = 10,
dom = 't',
columnDefs = list(
list(className = 'dt-center', targets = 1:5)
)),
rownames = FALSE,
caption = "Visualization Method Comparison")# Create a visualization of feature comparison
features <- data.frame(
Feature = rep(c("Speed", "Customization", "Interactivity",
"3D Support", "Animation", "Export Quality"), 3),
Score = c(9, 6, 2, 3, 1, 7, # Base R
7, 9, 3, 4, 4, 9, # ggplot2
5, 8, 10, 9, 9, 8), # plotly
Method = rep(c("Base R", "ggplot2", "plotly"), each = 6)
)
p_features <- ggplot(features, aes(x = Feature, y = Score, fill = Method)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
geom_text(aes(label = Score),
position = position_dodge(width = 0.7),
vjust = -0.5, size = 3) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Visualization Method Feature Comparison",
subtitle = "Score out of 10 for each feature",
x = NULL,
y = "Score",
fill = "Method") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p_features)Base R Graphics: - Quick exploratory data analysis - Simple diagnostic plots - When working in minimal environments - For very basic, no-frills visualizations
ggplot2: - Publication-quality static graphics - Complex multi-layered visualizations - When consistency across plots is important - For detailed customization and theming
plotly: - Interactive web applications and dashboards - When users need to explore data dynamically - For 3D visualizations - When sharing visualizations online
# 1. Use sampling for large datasets
ggplot(big_diamonds[sample(.N, 10000)], aes(x, y)) + geom_point()
# 2. Use binning for dense data
ggplot(big_diamonds, aes(x, y)) + geom_bin2d(bins = 100)
# 3. Use density plots instead of scatter plots
ggplot(big_diamonds, aes(x, fill = group)) + geom_density(alpha = 0.5)
# 4. Avoid overplotting with transparency
ggplot(data, aes(x, y)) + geom_point(alpha = 0.1)
# 5. Use efficient geometries
# geom_hex() is often faster than geom_point() for large data# 1. Limit data points for scatter plots
plot_ly(data[sample(.N, 10000)], x = ~x, y = ~y)
# 2. Use WebGL for very large datasets
plot_ly(data, x = ~x, y = ~y, type = 'scattergl')
# 3. Aggregate data before plotting
aggregated <- data[, .(mean_y = mean(y)), by = x]
plot_ly(aggregated, x = ~x, y = ~mean_y)
# 4. Use server-side processing for massive datasets
# Consider shiny or dash applications# Function to monitor memory during visualization
monitor_viz_memory <- function(viz_func, func_name) {
mem_before <- pryr::mem_used()
viz_func()
mem_after <- pryr::mem_used()
return(data.frame(
Method = func_name,
Memory_Used_MB = round((mem_after - mem_before) / 1024^2, 2),
Memory_Used_GB = round((mem_after - mem_before) / 1024^3, 3)
))
}
# Test memory usage
memory_results <- rbind(
monitor_viz_memory(base_r_plot, "Base R"),
monitor_viz_memory(ggplot_plot, "ggplot2")
)## Method Memory_Used_MB Memory_Used_GB
## 1 Base R 0.07 0.000
## 2 ggplot2 1.24 0.001
Create a comprehensive visualization dashboard that includes:
Take a large dataset (or simulate one with 1 million rows) and:
Using the diamonds dataset, create an interactive dashboard with:
This material is part of the training program by The National Centre for Research Methods © NCRM authored by Dr Somnath Chaudhuri (University of Southampton). Content is under a CC BY‑style permissive license and can be freely used for educational purposes with proper attribution.